This entry documents the first observability experiment I've run against the LLM Workflow Router project: attempting to instrument a deterministic, content-blind workflow enforcement layer using the emerging OpenTelemetry GenAI semantic conventions.

What made the experiment interesting is that the Router itself does not perform any LLM operations. It does not call a provider API, inspect prompts, generate completions, retrieve context, or execute tools. It evaluates structured metadata against a declared workflow topology and returns a terminal structural decision.

In practice, that meant the Router sat directly at the edge of the current semantic model. The instrumentation worked — but only by repeatedly stepping outside the vocabulary the conventions currently provide.

The system under instrumentation

The LLM Workflow Router is a stateless middleware engine that enforces explicit execution topology in AI systems relying on large language models. It evaluates interaction metadata against strictly declared workflow rules and returns a terminal state.

A quick disambiguation

The term “LLM router” is overloaded in this ecosystem. Most public projects under that name are model-selection routers — systems that pick between cheap and expensive models based on query complexity, task classification, or cost.

This Router is not that. It is a topology-enforcement layer: it validates whether a given workflow transition is permitted, independent of model choice or content. The two share a name but solve different problems.

Its responsibilities are intentionally narrow:

Importantly, the Router is deliberately content-blind. It does not observe prompts or model outputs. It observes only metadata and workflow structure.

Instrumentation methodology

Two operations were wrapped in OpenTelemetry spans: validate_workflow_config() and WorkflowEngine.evaluate().

Existing GenAI semantic conventions were used wherever possible. Every mismatch between the Router's semantics and the available conventions was recorded as a friction point.

Friction points observed

F1. No operation name exists for structural validation

The first mismatch appeared immediately during configuration loading.

The Router validates workflow topology before execution begins, but the current gen_ai.operation.name values all describe runtime inference activities:

None describe static structural validation.

The only accurate representation was a non-standard value:

gen_ai.operation.name = "validate_workflow_config"

This exposed a deeper assumption embedded in the conventions: that meaningful GenAI observability begins only once runtime model activity begins.

For deterministic orchestration systems, much of the actual safety work happens before execution ever starts.

F2. gen_ai.provider.name is structurally inapplicable

The Router has no provider.

It does not call OpenAI, Anthropic, Gemini, local inference servers, or any external model endpoint.

Yet the conventions strongly imply that a provider exists for all GenAI-related spans.

Both spans omitted the attribute entirely, which leaves them technically incomplete relative to the current semantic model.

This is fine when every GenAI-related span sits inside a model-invocation context. But it presumes the category of system in scope: model invokers. A different category — orchestration, governance, and policy-enforcement layers that sit adjacent to models rather than calling them — has no clear home in the current conventions, and this Router is one instance of it.

Comparable systems include LangGraph's graph-execution layer, Temporal-style workflow engines wrapping LLM steps, and guardrail frameworks like NVIDIA NeMo Guardrails that gate execution without generating content. Each shares the property of being structurally involved in AI execution while being inferentially uninvolved.

F3. Workflow topology has no first-class semantic vocabulary

The Router evaluates transitions between containers in a workflow topology.

To instrument that meaningfully, I needed to record:

No current gen_ai.* attribute represents those concepts.

The closest available attribute, gen_ai.conversation.id, describes conversational identity rather than structural execution position.

Using conversation semantics for topology semantics would have produced misleading telemetry, so custom attributes were introduced instead:

workflow.container
workflow.requested_action
workflow.result.state

This was one of the clearest points where the semantic model showed its conversational bias.

F4. Refusals are structurally meaningful, but semantically under-described

The Router emits typed structural refusal reasons:

These are not operational failures. They are deliberate enforcement outcomes.

The existing semantic conventions contain:

error.type

But that attribute describes failures of execution — timeouts, API errors, transport problems, and exceptions.

A workflow refusal is different.

The Router is operating correctly when it refuses an invalid transition. Refusal is a successful enforcement outcome, not an error state.

The only workable solution was another custom attribute:

workflow.result.reason

F5. The conventions assume content observability as the primary mode

One of the most interesting frictions was not a missing attribute, but a missing category of system entirely.

The current GenAI semantic conventions are heavily centered around:

The Router observes none of those things by design.

Its observability stance is intentionally structural rather than semantic. It traces metadata, topology, transition legality, and enforcement outcomes while remaining blind to user content entirely.

That distinction matters for both privacy and system architecture.

Current conventions implicitly treat content observability as the default shape of GenAI instrumentation. The Router demonstrated that another class of AI-adjacent systems exists: systems whose primary concern is structural governance rather than inference visibility.

F6. Standalone workflow evaluation has no clear trace-parenting model

Each Router evaluation generated its own isolated trace:

parent_id: null

This happened because there was no upstream model invocation span to inherit context from.

The current conventions strongly imply that workflow spans typically exist as children of model or agent execution spans.

But deterministic workflow enforcement can exist independently of any model runtime entirely.

In practice, this meant there was no obvious semantic guidance for how configuration validation spans and subsequent workflow evaluations should relate to one another structurally.

The result was observability fragmentation: every decision became an isolated trace instead of part of a larger structural lifecycle.

The deeper issue underneath the frictions

Individually, each mismatch seems small.

Together, though, they point toward a broader pattern: the current semantic conventions implicitly model GenAI systems primarily as systems of generation.

Prompts go in. Completions come out. Tools are called. Retrieval augments context.

That is a valid model for a large portion of the ecosystem. But it is incomplete for a class of systems that already exists: layers that participate in AI execution pipelines without themselves generating anything.

These systems still participate directly in AI execution pipelines. They still need observability. But their semantics are structural rather than conversational.

The current conventions can technically accommodate them through custom attributes, but only awkwardly and inconsistently — as the six friction points above illustrate.

That awkwardness is useful.

Friction logs are valuable precisely because they reveal the shape of the assumptions hidden inside a specification.

In this case, the experiment suggests that the GenAI semantic conventions may eventually need a clearer distinction between:

Right now those categories are blurred together under a model-centric view of AI systems.

The Router made the edges of that model visible.

Why this matters to me

One reason I care about observability work is that I increasingly think legibility is one of the central engineering problems of the AI era.

Not just model interpretability in the academic sense, but operational legibility:

Deterministic workflow systems are one attempt to make AI behavior more structurally accountable.

Observability is what makes those structures visible once they exist.

Each of the six frictions above is, in its own small way, a place where that visibility currently breaks: a structural validation step that has no name, a refusal that gets misread as an error, a transition that has no vocabulary, a governance trace that floats free of any parent. Each is a place where the system did the right thing and the telemetry could not quite say so.

That's part of why this friction log felt worth writing down. The interesting thing wasn't that the instrumentation failed — it mostly worked.

The interesting thing was what the points of failure revealed about how the ecosystem currently imagines AI systems in the first place.

And for an “in development” specification, that's exactly the kind of edge case worth exploring early. 🌙

Status: Experimental instrumentation complete · semantic convention analysis in progress · possible future proposal work depending on further exploration.